Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+)#22522
Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+)#22522aendk wants to merge 31 commits intoggml-org:masterfrom
Conversation
…t input pointer access, and "launch" after last write, e.g. to tensors like dst.
overlap execution with previous kernels
to enable hip/musa compatibility
works or is without effect.
kernels args. This fixes PDL.
and template alias and template expansion
|
Hi @aendk, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
ORippler
left a comment
There was a problem hiding this comment.
Given this opens up the possibility for hard-to-catch data-races, I feel we should make this toggle-able at run-time rather than a compile-time feature to facilitate easier debugging and a guaranteed-functionally-correct-path should a bug ever occur. Also, isn't PDL effectively performing a no-op on CC < 9.0 devices? If so, we can simply always compile it and rely on the run-time toggle (i.e. remove the cmake flag).
Also, please clean-up leftover comments before marking this as ready to review.
| #define GGML_CUDA_CC_TURING 750 | ||
| #define GGML_CUDA_CC_AMPERE 800 | ||
| #define GGML_CUDA_CC_ADA_LOVELACE 890 | ||
| #define GGML_CUDA_CC_HOPPER 900 |
| # define GGML_CUDA_PDL_SYNC() cudaGridDependencySynchronize() | ||
| # define GGML_CUDA_PDL_LC() cudaTriggerProgrammaticLaunchCompletion() |
There was a problem hiding this comment.
For transpilation we need to add the corresponding aliases for Musa/Hip (or guard this to be CUDA-only for now if these aliases are absent)
| GGML_CUDA_PDL_SYNC(); | ||
| // GGML_CUDA_PDL_LC(); // FATTN_VEC try 2; on maxq |
| const uint3 neqk1_magic, | ||
| const uint3 rq3_magic, | ||
| float scale) { | ||
| // GGML_CUDA_PDL_LC(); // GATED_DELTA_NET try 1; always followed by memcpy on qwen3.5, no benefit |
There was a problem hiding this comment.
| // GGML_CUDA_PDL_LC(); // GATED_DELTA_NET try 1; always followed by memcpy on qwen3.5, no benefit |
| constexpr int experts_per_thread = (n_experts > WARP_SIZE) ? n_experts / WARP_SIZE : 1; | ||
| float wt[experts_per_thread]; | ||
| float wt_sum = 0.f; | ||
| float output_weights[experts_per_thread]; |
There was a problem hiding this comment.
Those are per se not data-accesses, so there is no need to move them. Did you see actual perf gains for this?
Overview
Programmatic Dependent Launch (PDL) is a CUDA optimization for newer NVIDIA GPUs (CC >= 90; does not include Ada).

It enables overlapping execution of CUDA kernels of the same CUDA stream. Like CUDA graphs, it reduces kernel launch overhead on the device. The benefits of both are additive (PDL + CG > CG > PDL).
This can best be seen visually in this Nsight Systems screenshot of a single CUDA stream; kernels which should normally be strictly ordered are run concurrently:
PDL was already proposed last year in #15479.
This PR integrates better into the CUDA graph semantics, and has vastly better performance. On an RTX PRO 6000, a token generation phase speedup of 10% is not unusual, on DGX Spark, I've seen 4-5% improvement (model dependent, see detailed stats below).
For full PDL performance, kernels need to be equipped with two new features: A synchronization barrier (
GGML_CUDA_PDL_SYNC) and a launch signal (GGML_CUDA_PDL_LC). The synchronization barrier limits the kernel execution to wait on the data written by the preceeding kernel so that no race conditions or premature data accesses take place. The launch signal indicates at which point the current kernel can tolerate the start of the next kernel alongside it. Additionally, kernels need to be launched via the newggml_cuda_kernel_launch()function.The synchronization barrier can be placed by carefully inspecting the kernel code and identifying the first "real" data access (e.g. excluding pointer arithmetic) of the kernel input. The launch signal placement requires a bit of hand-tuning and benchmarking. In this draft PR, I enrolled all kernels used in
gpt-oss 20b,qwen3.5andnemotron 120B Super. Because these kernels are shared with other models, I've tested more models. I saw speed-ups in almost all models in token generation phases, with prefill/context phases being mostly neutral.Applied Heuristics:
GGML_CUDA_PDL_SYNC, a data race could occur. Before marking this merge-ready, I will double check this again. When reviewing, this should be kept in mind.GGML_CUDA_PDL_LCis a bit of trial and error. This is visible in some kernels where I've commented out some suboptimal placements in some commits. In some kernels, placingGGML_CUDA_PDL_LCis even perf negative (most notablymul_mat_vec_q). Generally, the earlier the signal is placed in the kernel, the more latency limited the kernel is, and the more shared resource contention (due to the premature launch of the successive kernel) the kernel can tolerate.Further Info on this Implementation
quantize_q8andmul_mat_vec_qare enrolled in PDL and are present in many models).GGML_CUDA_PDL_LCflag is a bit of trial & error, but good placement for one model appears to be beneficial for other models, too. In internal testing, I did not run into settings which are for example beneficial for model A, but worse for model B performance.Known issues/TODOs
GGML_CUDA_PDL_SYNC.GGML_CUDA_CC_HOPPERdid not work.How to test it
You need to have a newer NVIDIA GPU (e.g. Blackwell), and you need to compile with
-D GGML_CUDA_PDL=ONHow to enroll other kernels into PDL
ggml_cuda_kernel_launch()and setGGML_CUDA_PDL_SYNC(). Modifying the kernel launch without setting the sync barrier leads to a race condition.GGML_CUDA_PDL_LC(). My loose heuristic was to place it at the function start, measure performance, and then repeat the process for different locations in the middle of the kernel. I then picked the best performing placement. In my testing, placing it near the bottom of a kernel was almost always unproductive.Let me know if you are able to test it ! @ggerganov @JohannesGaessler @am17an @ORippler
Performance:
RTX PRO 6000
DGX Spark
Requirements